Project Group - 20¶
Members:
Sheikh Arfahmi Bin Sheikh Arzimi
Ewan Brett
Cedric Nissen
Nils Hollnagel
Luka Rehviašvili
Introduction¶
To keep people safe and improve the experience of visitors at big events like SAIL 2025 in Amsterdam, crowd management must be done well. Poor monitoring of population density can make it hard to keep an eye on things and cause overcrowding and other problems. The 2010 Love Parade in Duisburg (21 dead, 652 injured), the 2021 Astroworld Festival in Houston (10 dead, 4900 people suffered serious injuries - at least 732 of these were so severely injured that they required long-term medical treatment), and the 2022 Seoul Halloween Parade (146 dead, 150 injured) are all examples of disasters that show how important it is to keep an eye on happenings and respond quickly when necessary. This project creates an interactive dashboard to let the SAIL Crowd Monitoring Team (CMT) make decisions in real time, which will make it easier to regulate crowd levels. The dashboard was made for SAIL, but it is meant to be a modular tool that other big event planners can use. It is part of a larger plan for managing crowds and is meant to work with other ways of controlling crowds.
The (research) objective of the project is to develop an interactive dashboard that aids effective crowd flow management for the SAIL2025 event in Amsterdam. Interactivity is a must since the operational context requires quick judgements based on a wide range of inputs that are always changing (including vessel movements, road traffic, and crowd indicators). A static report won't work for this: operators need to look into the issue, sort it by time and place, and watch trends as they change. The dashboard doesn't replace policy or control on the ground, but quite the opposite, it brings together several pieces of information into one clear view that practitioners can act on right now.
By tying the work to SAIL2025, it gets a real-world setting with specific geography, traffic patterns, and stakeholders. This makes it more relevant and easier to test, and it also keeps the solution modular so it can be used at future events. The goal is affected by the decision environment (real-time monitoring), the data's characteristics (big, noisy, and only sometimes available), and the necessity for a tool that helps with quick, informed action.
Research Questions¶
These questions break the objective (see section “Research Objective”) into three parts that work together. First and foremost is (i) governance, to make sure the dashboard can be safely used in a work setting. Then we are dealing with (ii) representation, to make it easy to understand complex, quickly changing data. And last but not least is (iii) prediction, to use the available datasets to guess how crowd flow will change.
RQ1. Which measure can be used to enable a secure access control? This makes sure that only people who are allowed to see or act on sensitive, live information may do so.
RQ2. How can crowd flow be visualised most effectively? This is about choosing map and time-series views that show density and movement effectively while still meeting performance needs.
RQ3. Which algorithm is suitable to predict crowd flow based on the available datasets? This aims to find a method that is fit for purpose and strikes a balance between accuracy, timeliness, and robustness, taking into account the size and variety of the data.
Data Used¶
This project uses operational feeds and contextual geodata to let SAIL 2025 keep an eye on crowd flow in almost real time. The following datasets have been used in order to make dashboard come to life:
- Crowd flow (sensor counts) - given by faculty researchers. 46 sensors that are fixed and can send data in both directions every 3 minutes. Used for real-time monitoring and as the target variable in the forecasting module.
- Vessel positions - Vesselposition_data_20-24Aug2025.csv (from faculty researchers). Point observations with lon, lat, timestamp, vessel ID, and speed. Broken up into chunks and adjusted to UTC, then shown in a sliding window (15 minutes by default).
- Car flow - TomTom_data_20-24Aug2025.csv (researchers from the faculty). Records come with a date and a nested data field. They are flattened to (time, id, traffic_level) and connected to the Dutch road network for map visualisation.
- Geospatial setting a. Use NWB_roads.zip to get information on the NWB road network, such as road geometry and segment IDs. This zip file was also provided by the researchers from the faculty. b. c. Basemaps (OpenStreetMap and ArcGIS) for a map context. There is a GeoJSON layer (TRAMMETRO PUNTEN_2025) that shows tram and metro stops around the event area. You can see it on Amsterdam Open Data (https://maps.amsterdam.nl/open_geodata/geojson_latlng.php?KAARTLAAG=TRAMMETRO_PUNTEN_2025&THEMA=trammetro).
- Weather (KNMI) - CSVs including temperature, dew point, air pressure, wind speed, max gust, rainfall, and sunshine duration. These were combined with sensor counts into crowd_weather_merged.csv for the prediction trials.
- Authentication: a small, local store user_database.pkl for the Streamlit login page (this is a prototype for access control).
Answering RQ1: User Authentication¶
1.1 Introduction¶
To answer the research wuestion RQ1 (Which measure can be used to enable secure access control?), several methods of user authentication functionalities were investigated. At first, the OpenID Connect Methode enables useres to log into a streamlit page using existing Google or Microsoft Accounts. As a second option, a widely applied username and password based solution was evaluated. The later option was selcted as is allows a step-by-step approach in the construction of code that is easy to follow for the sake of this project. This mainly holds because all functionalities are streamlit based and no external links must be established. The developement of helper and main functions to implement the approach into streamlit can be reviewed in the User_Authentication file.
1.2 Login Interface¶
The Login Interface forms the basis for any user interaction. Assuming the user has an existing account (either predefined by system admin or self-created through Sign-Up), the user is guided through the login procedure. Upon successful completion of the process, full access to all dashboard pages is granted. The user stays logged-in until he* decides to log out again.
1.3 Sign-Up Interface¶
In case the entered username is non-existant, the system will advise the user to either check for spelling errors or create a new account by signing up. The newly entered credentials are then stored in a pickle database. To enhance security, passwords are always encoded via hashing before storage.
1.4 Access Control across the Dashboard¶
As the Sail2025 Crowd Management Dashboard features multiple pages (e.g.: Homepage, Settings Page, etc.), the user authentication must guarantee that all pages can only be accessed after a user has successfully logged in. This is achieved by calling a function (check_login_status()) before any content of any page is loaded. To showcase this functionality, the screenshot below illustrates the error message thrown at the user when trying to access pages without loggig in.
Furthermore, the user is directly guided to the home page, where the login page is displayed by default. He* can then make user of the above-expained Login and Sign-Up Interfaces.
Answering RQ2: Data Visualisation¶
2.1 Introduction¶
Visualising the data provided by the sensors dotted around the inner city and port of Amsterdam allows the user to identify urgent situations and act accordingly to mitigate any potential risk. Displaying past data further allows to identify patterns in crowd behaviour. The type of visualisation choosen must be appropriate for its intended purpose. As a result, different types of map visualisations for live data monitoring and a line graph to illustrate past data was employed. In the following sections, the various methods employed to visualise crowd data are explained in detail.
2.2 Crowd Flow¶
In order to not only display the crowd count, but also relates it to the effective width and the time interval of three minute, the crowd flow is calculated. The function "calculate_crowd_flow" devides the crowd count by the effective width of the respective sensor and devides it with the 3 minute interval. The resulting crowd flow can depict the relative intensity at the location of the sensor. This function takes a timestamp as an input and calculates the resulting crowd flow for the respective timestamp. Finally, a dictionary with the crowd flow is returned which could be appended to dataframes or displayed directly on a map or graph. Each sensor direction is calculated and returned seperately.
2.3 Time-sensitivity, simulating live data¶
Within this project, to make the dashboard as oriented to the real world scenario, we developed an auto refresh, this is done by using a session state which would update the dashboard every three minutes (how often the data was updated), this allowed for it to show the most recent data that was received. For simulation purposes, the refresh rate was sped up to 5 seconds. This is done by telling the programme to read the next line of the csv file, this continues while viewing other pages created. Within this, a feature was created in which after a refresh the position of the map (zoom and area of focus) would be preserved, as to not move it back to any default settings.
2.4 Map Visualisation¶
Regarding visualisations, we choose three main types, bidirectional arrows, circles, and a heat map. Each serves a different purpose. At each measuring location, there are two directions counted 180 degrees from each other. Therefore, with the bidirectional arrows, you can easily visualise the count per sensor locations for both directions. Allowing for an easy overview, per location per direction. With the heat map, it shows a clearer overview across the whole area, making it easier to see high density and low density areas. In terms of colour coding, with the bidirectional arrows, this was split up into 4 groups, this was done based on the maximum counts within the respective dataset. With the heat map, the colours were standard, therefore green/blue means low intensity and red means high intensity. To show both crowd count and crowd flow, a toggle was created which allows the switching between both modes. For each type the same visualisations are shown, to allow consistency when viewing the data.
2.5 Multi-User Settings¶
A settings page was created; this allows for user preferences to be kept. This allows for one user to open on the visualisation of a heat map, whereas another user can decide to open it on heat maps. They are saved for each user, on their account.
2.6 Crowd Data Graph¶
The page "Crowd Data Graph" displays the crowd count as a line graph. The graph as shown below displays the data from a datagraph which updates live, adding the data from the latest timestamp to the dataframe. For this purpose, the function "add_new_row" is used. Similarly to the map explained above, the graph page uses the autorefresh function, employing a session state which reloads the graph every three minutes. The directions of the sensors are displayed seperately.
The page allows the user to select and deselect all sensors which would be displayed with different colours. Additionally, when hovering over the graph, the precise timestamp and count is displayed.
2.7 Car Flow¶
The TomTom feed (TomTom_data_20-24Aug2025.csv) keeps track of traffic conditions on each road segment as (time, id, traffic_level), although the pairs come in a single data field. The goal was to make this into a table and a map that clearly shows the state of the network.
Analysis¶
The nested data field is flattened into neat rows of (time, id, traffic_level), with types forced and timestamps set to UTC (seen in Europe/Amsterdam). To make the app more responsive, records are put into 3-minute frames and written to a small snapshot (CSV.GZ/Parquet). The app only reloads when the source file changes. The NWB road network (a zipped shapefile) gives us segment geometry. The right NWB identifier is chosen by making the most overlap with TomTom IDs in the current frame. Traffic_level keeps and styles matching features.
Result¶
The page has one interactive map that shows NWB segments that have observations in the current 3-minute frame. Traffic levels are shown by the colour of the lines: green, yellow, orange, and red. You can get a short tooltip with the Road ID and Traffic level (rounded) when you hover over or click. The display automatically plays every three minutes and moves forward when new rows are added to the source file. At the bottom, there is a summary of the status, including the auto-play cadence, the current frame time, the roads that were displayed, and the data range that the snapshot covers. To avoid misleading geometry, segments that don't have a reliable ID match are left out.
2.8 Vessel Data¶
The vessel stream (Vesselposition_data_20-24Aug2025.csv) gives point observations such longitude, latitude, date, identifier, and speed if it's available. The dashboard shows a near-real-time picture: the most recent position of each vessel in the last 15 minutes, not a full historical replay.
Analysis¶
Only the operational columns are taken from the larger raw file (Vesselposition_data_20-24Aug2025.csv), which also has radar and orientation information. These columns are lon, lat, upload-timestamp, id, and speed-in-centimeters-per-second (shown as m/s). The timestamps are set to UTC and presented in Amsterdam, Europe. A 15-minute sliding frame, based on the file's most recent timestamp or a selected anchor, picks out recent records. Within this window, the most recent record for each vessel is preserved to give a clear current view. When the file that the page is based on changes, the page reprocesses.
Result¶
The event region on the map has blue dots that show where the vessels are. When you hover over or click on the vessel ID, local time, exact longitude and latitude, and speed in meters per second, a little tooltip appears. A sidebar multiselect of vessel IDs lets you focus on tracking: any picked vessels are highlighted in red, while others stay blue. This makes it easier to follow targets without losing sight of the big picture. A status footer gives a quick overview of the scene by showing how many boats are now displayed, the most recent file time in UTC to which the snapshot is anchored, the auto-refresh period (every three minutes), the 15-minute frame in use, and the path to the data source. These parts work together to give operators a real-time, operational view of what's going on on the water while keeping the interface snappy.
Answering RQ3: Crowd Count Prediction¶
3.1 Introduction¶
Crowd count prediction serves to estimate the number of people across space and time. In this case, space is divided based on the location of crowd count sensors and time is split into 3-minute intervals based on the sensor update frequency. Therefore, the goal is to predict the future crowd counts for each available sensor in multiples of 3-minutes intervals (i.e 3, 6, 9 ... into the future). The Extreme Gradient Boosting (XGBoost) machine learning model was experimented with in this project. It was selected due to its robustness and ability to predict non-linear relationships. The next few subsections will outline the steps taken to develop this model.
3.2 Data Sources / Requirements / Preprocessing¶
The data sources used are: (i) Crowd Count from Sensors (by timestamp and sensor name), and (ii) KNMI Weather Data (by timestamp, includes Temperature, Dew Point, Air Pressure, Windspeed, Max Gust, Rainfall, Sunshine Duration). Both are comma-separated-value (.csv) files located in sensor_data and weather_data respectively. They were imported, inspected for missing values and cleaned. The data was then visualised to capture general trend(s). This process can be found in Read_Sensor_(Crowd)_Data and Read_KNMI_WeatherData. Both datasets were then merged into one single dataframe with timestamp set as index and saved as crowd_weather_merged.csv . The crowd count and weather trends are shown below.
From the almost sinuisoidal curve of the crowd count, we can tell that the overall crowd count peaks daily at around late afternoon and dips slightly after midnight. This is the general crowd trend, but the distribution across sensors may vary greatly.
From the weather data, no general trends can be concluded. However, rainfall was zero for most of the event and thus, it can be hypothesised that rainfall will not be a strong feature to predict the crowd count.
3.3 Target and Features¶
The target 'Y' of the XBG model is the data to be predicted, which is the future crowd count of all sensors. The features, 'X', are the remaining data in crowd_weather_merged.csv. They include 'hour', 'minute', 'day', 'month', 'weekday', 'is_weekend', 'temperature' , 'dew_point', 'air_pressure', 'wind_speed', 'max_gust', 'rainfall', 'sunshine_duration' and 'relative_humidity'.
Prior to training the XGB model, pre-ML data exploration was conducted to investigate the correlation between the different data types. This process can be followed in Pre_ML_DataExploration. The data exploration gives us insights on the expected results and feature importances of the trained machine learning model.
For example, in the correlation matrix among crowd count of sensors above, we can see a positive correlation among sensors pairs with opposing directions (CMSA-GAKH-01_0 and CMSA-GAKH-01_180). However, there is weak correlation between sensors that are further apart. (Note that in the figures below, there is insufficient space to print all the y labels)
The correlation matrix between weather data and crowd data was also plotted to see which feature has strong or weak correlation with crowd data. As shown in the figure above, the 'hour' feature has the strongest overall correlation with the crowd count. It is thus expected to be one of the most important feature for the XGB model. Other potentially important features include 'temperature', 'wind speed', 'max_gust', 'sunshine duration' and 'relative humidity'.
From the graph of Total Crowd Count over Time above, it was found that historical data may be able to predict current data as the crowd count shows some seasonality characteristics. To determine how far back in time should crowd count be considered as feature, an Autocorrelation Function (ACF) graph of CMSA-GAKH-01_0 was plotted. The x-axis of the graph indicates the time interval between the current observation and past ones, also know as the 'lag', while the y-axis represents the correlation values between current observation and observation at a particular lag. The blue shaded area represents the 95% confidence interval. Any bars that extend beyond (above/below) the blue area is statistically significant.
From the graph above, it was observed that there are significant spikes at regular intervals of 200-250 steps. This confirms seasonality in the crowd data, yet not all of the lags are statistically significant. Therefore, another ACF was plotted to zoom into the lags that are statistically significant.
From the graph above, we can deduce that lags up to lag_75 are statistically significant and thus will be considered as features.
In addition to lag features, rolling mean features were also included. Rolling mean is the average of crowd count over a fixed interval of historic data. It's purpose is to smooth out short-term volatility and reduce noise to reveal underlying and longer term trends. The rolling mean intervals were chosen to capture short-term (3-5), medium-term (10-20) and long-term (30-60) trends. The additional features were created and then appended to the merged dataframe.
3.4 Training Procedure¶
The features and target were split chronologically with data from 20-23 August 2025 assigned as training data, while the remaining data (23 August 2025, 00:00 onwards) assigned as test data. Random splitting were also experimented with in TS_2_RandomSplit_MultiOutputReg. However, the approach involves the leakage of future data into the prediction of past/current data and was thus rejected as a realistic approach.
Another approach in the training strategy was also to train one model per sensor (TS_1_RandomSplit), as opposed to one universal model for all sensors. However, upon quick inspection of the crowd count of all the sensors, it was observed that most sensors follow the seasonality pattern and peak at similar times. Thus, a single model was preferred for ease of maintenance and to reduce overfitting.
3.5 Model Evaluation¶
The metrics used to evaluate the model are mean absolute error (MAE) and root-mean-squared error (RMSE). Instead of focusing on the performance of the model in general, the model was evaluated based on how well it predicts data for each location. This is more meaningful as it accounts for the differences between locations. The model performed best for GASA-06_95 with MAE = 0.080795 and RMSE = 0.087036.
On the other hand, it performed worst for GASA-02-02_135 with MAE = 25.207134 and RMSE = 58.591652.
The model's feature importance was validated against insights gained from data exploratory.
Based on the figure above, the mean of the past 3 time intervals (past 9 minutes) was the most important feature in predicting the current crowd count. The top features were also dominated by rolling mean and lag features, which matches the strong seasonality pattern observed across most of the locations. In general, weather features were less important compared to historical data, though it should be noted that temperature is the most important feature in the model among all weather features.
3.6 Deployment Strategy¶
The XGB model is stored locally as crowd_count_model.pkl and loaded in 4_Predictive_Analysis.py. The streamlit app fetches live sensor data via load_live_sensor_data() in data_loader.py every 3 minutes. A function create_features() was created to transform historic and current data as inputs to the model. Predictions are genererated on demand per sensor selection. By default, the forecasting window is set to 20 intervals (1 hour) and the prediction is done recursively using recursive_forecast() function. The results were plotted in an interactive line graph that shows historical, current and predicted counts. For the purpose of visualising the accuracy of predictions, actual future counts were also plotted. In terms of User Interface, colour scheme was standardised across all graphs for the different type of data, with legends present to assist users. The default zoom focuses on the past hour plus the forecast horizon for immediate insight. The figure below shows a screenshot of the predictive analytics visualisation.
In terms of extensibility, The model can be updated periodically with new sensor and weather data.
3.7 Limitations of Model¶
The weather data from KNMI was tracked hourly. Therefore, shorter-term changes in weather conditions were not captured and this could explain the weak importance of weather features in predicting the crowd count. Additionally, more data, such as event timeline and location, vessel position, carflow data, could be added into the pipeline and this could strengthen the model. In terms of training strategy, the trained XGB model was tested only on weekend data. This explains why the 'isWeekend' feature is not important at all. If the SAIL event was longer with more data points, the train-test split could be done such that the training and testing data consist of both weekday and weekend data while still maintaining a chronological split. The hyperparameters of the XGB model could also be optimised with Hyperparameter Tuning, creating a model with the best performance for this use case. However, this was not done due to time limitation of this project. Lastly, to increase usability and user-friendliness, the predicted values for all sensors could be visualised in a map with adjustable time horizon. Thresholds could be set to warn crowd monitoring team of higher than acceptable crowd counts.
Conclusion¶
Concluding this project, all posted research question were answered within the development of the dashboard. Starting off, a user authetication process is featured to ensure that only legitimate users can access the dashboard (RQ1). Furthermore, a combination of different visualisation types is applied to aid crowd management operators within their daily operations (RQ2). The customisable map displaying current crow flow data allows an overview while deeper insights can be gained by navigating to the Crowd Flow Graphs and Vessel Positioning Pages. The predictive analysis page aids operators in developing a future crowd management strategy based on past developement (RQ3). They must not rely on their own conclusions drawn from visualisations of the current situation but can count on the ML model insights.
Even though all RQs have been answered, there are several limitations and opportunitioes for further research. In regards to the user authentication process, it must be mentioned that the Sign-Up option currently allows any user to create a profile. In a real case, this feature should be disabled or expanded by am additional verification so that only legitimate users are granted access. Additionally, a passord reset feature could be investigated and added to the system.
Looking at the Crowd Flow Visualisation, given the timeline of the project, there is potential to add further visuals for the crowd management operators to work with (e.g.: total number of visitors, etc.). Additionally, a scrollbar may be handy to allow retrospective views of the crowd flow situation. Regarding the car and vessel flow data, it could be merged to the main page of the dashboard to better comprehend the information. Enabling a notification feature would make the dashboard even more proactive and allow the operators to rather focus on taking action than conducting dashboard-aided analysis.
In terms of Predictive Analysis, the variety of datasets used may be increased to improve prediction accuracy. In addition, the visualisation could be adapted so that the map visualisation (home page) also features predictive insights. Consequently, user-friendlyness can be increased. Drawing attention to further research, other ML model besides the applied XGB could be applied and evaluated against each other.
Contribution Statement¶
**Author 1: Sheikh Arfami Bin Sheikh Arzimi **:
- Coding: Future Crowd Flow Prediction (ML Module), Tram and Metro, Settings Page
- Reporting: RQ3 Crowd Flow Predictions
- Project Management (Task Distribution)
**Author 2: Nils Hollnagel **:
- Coding: User Authentication
- Reporting: RQ1 User Authentication & Conclusion
- Project Management (Report Structure, Task Distribution, Group Organisation)
**Author 3: Cedric Nissen **:
- Coding: Crowd Data Graph Page, Crowd flow calculation and implementation (Home Page)
- Reporting: RQ2 Crowd Flow Visualisation
**Author 4: Ewan Brett **:
- Coding: Home page automatic updating, visualisations (heatmap, arrows, circles, default map settings)
- Reporting: RQ2 Crowd Flow Visualisation
**Author 5: Luka Rehviašvili **:
- Coding: Vessel Positioning, Car Flow Data Visualisation
- Reporting: Research Objective, Data Used, RQ2 Crowd Flow Visualisation